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Abstract 

The Kaczmarz method is an algorithm for finding the solution to an overdetermined consis- 
tent system of linear equations Ax = b by iteratively projecting onto the solution spaces. The 
randomized version put forth by Strohmer and Vershynin yields provably exponential conver- 
gence in expectation, which for highly overdetermined systems even outperforms the conjugate 
gradient method. In this article we present a modified version of the randomized Kaczmarz 
method which at each iteration selects the optimal projection from a randomly chosen set, which 
in most cases significantly improves the convergence rate. We utilize a Johnson-Lindcnstrauss 
dimension reduction technique to keep the runtime on the same order as the original randomized 
version, adding only extra preprocessing time. We present a series of empirical studies which 
demonstrate the remarkable acceleration in convergence to the solution using this modified 
approach. 

1 Introduction 

The Kaczmarz method |18j is a popular algorithm for solving overdetermined consistent systems of 
linear equations. Due to its simplicity and speed, it has been used in a variety of applications ranging 
from tomography to digital signal processing [H[2TJ[l9]. The method uses a series of alternating 
projections to iteratively converge to the solution of Ax = b, and is therefore computationally 
feasible even for very large systems. Given an initial guess xq and denoting by a%, . . . , a m the rows 
of the m x n matrix A, each iteration of the method orthogonally projects the current estimation 
onto the next hyperplane defined as the solutions to {ai,x) = bi, chosen in a cyclic fashion. The 
algorithm can be described by the iterations: 

b[i] - (aj,x k ) 



2 



where x k is the k th iterate, b[i] (here and throughout) denotes the ith coordinate of b, and i = (k 
mod m) + 1. 

Although this technique has been used in practice for quite some time, theoretical guarantees 
on convergence were difficult to obtain [8j [T2l [T3] . It is clear that by design of the algorithm, 
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the convergence rate depends on the ordering of the rows in A. Therefore, poorly ordered rows 
can lead to slower convergence. To overcome this difficulty, the rows of A can be selected in a 
random fashion. It has been observed that this randomized version of the algorithm improves the 
convergence rate [I9j [14] , however only recently have theoretical results been obtained [2U [25[ [20] . 



1.1 Randomized Kaczmarz 

In [244 125] . Strohmer and Vershynin propose at each iteration to randomly select a row of A 
with probability proportional to the Euclidean norm of the row. The randomized Kaczmarz (RK) 
method can thus be described by 

b\p(i)\ - (a p{i ),x k ) 
Xk+i = x k ^ |i a p{i) , (1.1) 

\\ a p{i) lb 



where p(i) takes values in {1, . . . , m} with probabilities vffia 2 ■ Here and throughout, || A\\p denotes 



||2 

the Frobenius norm of A and 1 1 - 1 1 s denotes the standard Euclidean norm or spectral norm for vectors 
or matrices, respectively. The selection rule used here is not optimal in general. The motivation for 
setting the rule according to the weight of the row norms is two-fold. First, it allows for a guarantee 
of expected exponential convergence for the Kaczmarz method [25] . Second, it is a computationally 
efficient strategy since often these values will be known approximately or exactly, and will only 
need to be computed once. A selection rule of this type is of course also related to the idea of 
preconditioning the matrix A by scaling its rows. Although other diagonal preconditioners may 
certainly perform better in general, finding such an optimal preconditioner is itself an optimization 
problem of high complexity. With the above selection strategy, the following exponential bound 
was shown in \2A\ [25] for the convergence in expectation of this randomized method: 

|2 / f-, l\K . |2 



E\\x k -x\\i <{1--) ||z -a:|||, (1.2) 

where R = ||A _1 || 2 ||^4||^ and xq is an arbitrary initial estimate. Since we will always assume that 
A has full column rank, the norm || A" 1 1| = inf{M : M||Ac|| 2 > \\x\\ 2 for all x} is well-defined. 
This bound is essentially independent of the number of rows of A. Moreover, the bound shows that 
for well conditioned matrices A, the RK method yields expected exponential convergence to the 
solution in just O(n) iterations (see Section 2.1 of [25]). Since each iteration consists of a single 
projection taking O(n) time, this shows that the overall method has 0(n 2 ) runtime, which is clearly 
superior to other methods such as Gaussian elimination which takes 0(mn 2 ), especially when the 
system is very large. The discussion in [25] shows that the randomized Kaczmarz method often 
even outperforms the celebrated conjugate gradient method. For example, when A is a Gaussian 
matrix and m > 3ra, the RK method provably requires fewer computations, and empirical studies 
show that this improvement is substantial. See Section 4.2 of [25] for details. 

The empirical and theoretical benefits of this approach lead one to ask whether it is also ac- 
curate in the more realistic case when noise is present. One may thus consider the (now possibly 
inconsistent) system Ax ~ b + w where w is an arbitrary error vector that has been added to the 
consistent system Ax = b. It is shown in [20J that in this case we have exponential convergence to 
the solution within an error factor: 

1 \ fe/2 , 



^\\Xk - X\\2 < \1 - —J \\xq\\2 + VRj, 
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where R is the same as above and 7 = maxj 




. It is also shown that this bound is sharp and is 



attained even for simple examples |20j . 
1.2 Modified approach 

To further improve the convergence rate of the RK method, we suggest a different approach to 
selecting the rows of A. Although our ideas should also apply seamlessly to the case when noise 
is present and the system becomes inconsistent, in this work we only consider the noiseless case 
and leave a detailed analysis in the presence of noise for future work. Since the projections in 
the algorithm (TTTTj) are orthogonal, it can be seen that the optimal projection in the kth. iteration 
is the one that maximizes ||sGfc+i — Xk\\2- By definition of the iterations (ll.lj) . one can calculate 
these quantities by computing inner products between the rows a% of A and the current iterate x^. 
Since computing one inner product requires O(n) operations, we clearly cannot afford to perform 
more than a constant number of these in a given iteration. Our approach, therefore, is to project 
the rows of A onto a lower dimensional space in such a way that the geometry of the vectors is 
approximately preserved. We then perform calculations of the form (II. ID with respect to these low 
dimensional vectors, and select the best projection. 

By construction, our modified algorithm will converge to the solution of Ax = b in the worst 
case as fast as the standard RK method. In practice, we expect the convergence to be much 
faster, especially when the lower dimension d onto which we project the rows is not too small. The 
improvement in each iteration can be quantified in terms of d and the current estimation as we 
demonstrate in Section [31 In Section [4] we demonstrate that empirically our technique outperforms 
the standard RK method in terms of convergence rate. The runtime of this modified algorithm 
of course depends on the dimension d onto which we project the rows of A. If d <C n, then each 
iteration will require 0(dn) operations, meaning that the overall runtime for expected exponential 
convergence becomes at most 0(dn 2 ). There is a tradeoff in the choice of the dimension d. If d is 
small, then the runtime per iteration remains small. However, if d is too small then our technique 
reduces to the standard RK method, and we will not gain in the convergence rate. In the next 
section we discuss the runtime and implementation, and show why the selection of d on the order 
of logn is the right choice. This gives a worst case runtime of 0(n 2 logn), although we expect a 
much faster convergence as we also see in simulations. This worst case runtime however, is still the 
same as that of the standard RK method up to the log factor. 

2 Implementation and Runtime 

Since the projections in the algorithm are orthogonal, one easily sees that the optimal projection 
in the fcth iteration would be the one that maximizes — Xfc||2, or equivalently, the term 



Unfortunately, calculating this term takes O(n) time, so that to keep the overall runtime at 0(n 2 ) 
one can only afford to make this computation a constant number of times. However, if we could 
significantly reduce the dimension of the vectors a, and Xk used in the calcuation, then more of these 
calculations could be done at each iteration, and the best out of those computed could be chosen, 
leading to accelerated convergence. Our idea is thus to project the vectors onto a low dimensional 



\b[i] - (a,i,Xk)\ 




Qi 2 
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space such that the geometry is preserved. This will allow approximation of the inner products 
{ai,Xk) and the norms 1 1 ct^ 1 1 2 (if they are not known a priori) from the projected data. To do this, 
we will consider a Johnson-Lindenstrauss type projection. The well-known Johnson-Lindenstrauss 
Lemma [17] states that with high probability, there is a projection of a finite set of points onto 
a space logarithmic in the number of points that approximately preserves geometry. This can be 
summarized as follows. 

Lemma 2.1 (Johnson-Lindenstrauss |17j ) Let S > and let S be a finite set of points in W 1 . 
Then for any d satisfying 

d>C 1 -^, (2.2) 
there exists a Lipschitz mapping $ : W 1 -»• R d such that 

(1 _ 5 )\\ Si - Sj f 2 < \Msi) - $( Sj )f 2 < (1 + S)\\ Si - Sj \\l (2.3) 
for all Si,Sj £ S, where C is an absolute constant. 

Remark. The value of C in which this lemma holds depends on the distribution from which 
<]? is created. When $ is Gaussian, one has C < 8 [7]. 

Although this lemma as stated only guarantees existence of such a mapping, in their proof 
the map <J> is chosen as the projection onto a random (f-dimensional subspace of W 1 . This result 
has been improved over time and now one can easily construct such a (random) projection which 
preserves the geometry (see e.g. [UEUE]). Indeed, it is shown in [1] that whenever a distribution 
satisfies certain moment conditions, the dxn random matrix $ whose entries are chosen i.i.d. with 
respect to that distribution will satisfy (|2.3p with high probability provided d satisfies (|2.2|) . The 
Guassian distribution, for example, satisfies these moment conditions; therefore the matrix with 
i.i.d. Gaussian entries will preserve geometry with high probability. Recently there has been work 
on constructing transforms which satisfy (12. 3p but that also provide a fast multiply (see e.g. [2lll5j). 
For example, Ailon and Chazelle construct in [2] a C5~ 2 log\S\ x n transform $ satisfying (|2,3p 
with high probability whose multiply requires roughly ralogn + 5~ 2 log 3 \ S\ operations. Hinrichs 
and Vybiral provide a multiply using nlogn operations when $ has slightly more rows, on the 
order of log 2 |5|. Even more recently, Ailon and Liberty show in |3] that when the d x n <E> is the 
composition of a randomly subsampled Hadamard (or Fourier) matrix and a random sign matrix, 
then $ satisfies (|2.3p with high probability when d = C5~ 4 log \S\ log 4 n. This matrix has an n log n 
multiply, and so this result provides an optimal fast JL transform, up to the power —4 on S and 
the polylogarithmic dependence on n. 



2.1 Implementation 

In our setting, the Johnson-Lindenstrauss Lemma allows us to project the rows of A as well as the 
estimations x^ onto a space of substantially lower dimension. This will then let us approximately 
calculate the terms in (|2.ip using far fewer operations, which we can use to decide on which 
hyperplane to project the current estimate. We note that the projection of the rows of A will be 
performed offline, adding to the preprocessing time, whereas the projection of the estimation Xk 
will be done at each iteration. We choose to use an analagous strategy as in the RK algorithm 
for our row selection; that is, using the weight of the row norms. This choice yields a worst case 
convergence rate which is the same as that guaranteed by the original RK method (see Remark 
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2 below). This leads to the following modified randomized Kaczmarz method, called Randomized 
Kaczmarz via Johnson-Lindenstrauss (RKJL) which can be summarized as follows. 



Randomized Kaczmarz via Johnson-Lindenstrauss (RKJL) 



Input: m x n matrix A, coefficient vector b £ M. m , parameter d, initial estimate xq 
Output: Approximate x solving Ax = b 

Initialize: Set k = 0, create a d x n Gaussian matrix $ and set at = tfrdj. Repeat the following 
O(n) times: 

Select: Select n rows so that each row a-i is chosen with probability II^HI/II^Hf as in 11.11 For 
each row selected, calculate 

\b[i] - (oti,^x k )\ 



7< 



«t 2 



and set j = argmaxj 7$. 
Test: For aj and the first row a>i selected out of the n, explicitly calculate 



* \b\j] ~ (aj,xk)\ , * [&W ~ (ai,x k )\ 

tj 11 11 dllU ^fi 11 11 

Il a jll2 \\ai\\2 



If 7f > 7*, set j = L 
Project: Set 

Update: Set k = k + 1. 



6|jf] - (aj-,Xfc) 

a^fc+i = ^fc H n — p a j- 

Il a ill2 



Remarks. 1. We show in Section [3] that d = O(logn) will be enough to approximately preserve 
geometry and thus give convergence improvements. Of course greater values of d may give greater 
improvements on convergence, at the expense of more computational cost at each iteration. 

2. In the Test stage of the algorithm, we see that in addition to approximating the n inner 
products, we also exactly calculate the inner product of a randomly selected row and also the row 
that was chosen. This will guarantee that the convergence is not slowed by any drastic consequences 
of the error in the approximations, and does not affect the overall runtime. 

3. We note that the initialization step may of course be computationally expensive, as is 
calculating the probabilities p(i) in the standard version (jl.ip . However, this need only be done 
once, and thus this version of the algorithm will be beneficial for situations in which the same matrix 
A is used in many problems. This is the case for many applications such as the wave-scattering 
problem [22] and structural mechanics problems [11] : see [5j[23] for others. 

We next turn to an analysis of this modified method. In Section H] we provide numerical results 
demonstrating the improved convergence rate. 
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2.2 Runtime 



As discussed above and in [24} 125]. the standard randomized Kaczmarz method converges expo- 
nentially fast to the solution in O(n) iterations, resulting in a total runtime of 0(re 2 ). The RKJL 
algorithm will thus also converge (in expectation) in at most O(n) iterations, and so it remains to 
calculate the runtime of each iteration. 

The first step in an iteration is to calculate Since <3? is a d x n matrix, this computation 

in general takes 0(nd) time. Next, each lives in a d dimensional space, so that calculating 
inner products in the selection step costs only 0(d). Since we calculate n such inner products, the 
total calculation time is 0(nd). The projection and update steps clearly take O(n) and 0(1) time, 
respectively, leading to an overall runtime per iteration of 0(nd). Therefore, after 0(n) iterations, 
we see that the overall runtime of the algorithm is 0(n 2 d). Lemma l3.1l b elow shows that d can be 
chosen on the order of logn. Therefore, RKJL converges exponentially fast in at most 0(re 2 log re) 
time, which is the same as the runtime of standard RK, up to the log factor. In practice, the 
runtime is much faster as we show in Section [U 

Because the algorithm will need to use roughly re 2 rows of A (assuming n 2 < m), the matrix 
$ will have to be applied to at least this many vectors, resulting in a 0(n 3 d) initial cost. If the 
algorithm is used repeatedly for various problem instances over the same matrix A, then one may 
wish to apply to all the rows of A, yielding a 0(mnd) cost. Even when d = O(logn) this is 
of course substantial, but for applications in which the algorithm will be used many times, this 
one-time cost will become minimal. 

This initial computational cost occurs in other methods as well, for example, in submatrix 
selection algorithms that take 0(mn 2 ) to select a well represented nlogn x n submatrix of A (see 
e.g. OlO]). These methods randomly select the submatrix (according to a particular probability 
model), and if the submatrix selected represents A well, then it can be used to solve the system 
Ax = b. However, there is some probability that the subsystem cannot be solved, in which case 
it must be reselected again, and so on. This is in contrast to RKJL, for which we are always 
guaranteed convergence, and most likely with improved expected convergence rate. The choice of 
method will of course depend on the application, and in some cases it may even be beneficial to 
use some combination of these approaches. 

3 Analytical Justification 

We next analyze how the Johnson-Lindenstrauss Lemma is utilized by our method. We will assume 
here that the system is real- valued and homogeneous (ie. Ax = 0), that the rows of A all have unit 
norm, and that the initial guess xq also satisfies ||xo||2 < 1- These assumptions are of course not 
necessary, but will make the analysis simpler. We discuss the case where the row norms may be 
far from equal in Remark 3 below. We begin with an easy lemma which shows that the geometry 
of the vectors used in the RKJL method is approximately preserved. 

Lemma 3.1 Let $ be the rex d (Gaussian) matrix with d = C8~ 2 log(n) as in the RKJL method. 
Set 7, = ($aj, $>Xk) also as in the method. Then \~fi — (aj, x^) \ < 25 for all i and k in the first O(re) 
iterations of RKJL. 

Remark 1 This lemma shows that with d chosen on the order o/0(logre), the geometry of the 
vectors involved in the RKJL method is approximately preserved. In practice, d should thus be 
chosen of this order to gain improvements in convergence. 
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Proof. We employ the Johnson-Lindenstrauss Lemma (Lemma 12. ip with S consisting of all Oj and 
x k in the first O(n) iterations of RKJL. Then since |<S| < n 2 , the condition ()2.2p is satisfied, and 
so (|2.3p holds for all a, and used in the algorithm. By this and the parallelogram law, we have 

= - (||$a; + <&X k \\l - \\&Oi - ®x k \\l 

< -((1 + <5)||ai + xfclH - (1 - 5)||ai - XfcUl 

= (ai,x k ) + jS(\\ai + x k \\l + \\a,i - x k \\l) 

< (ai,x k ) + 25. 

Similarly we have that ji > (ttj, x k ) — 25, which completes the claim. ■ 

This shows that the terms 7j used for selection in the algorithm are approximately equal to the 
actual desired values (a^, x k ). Thus for 5 small, the RKJL algorithm makes well educated decisions 
at each iteration which allows for quicker convergence (see Theorem 13.21 below) . This also shows 
that when the estimation x k becomes very close to the true solution x = 0, the error 5 begins to 
dominate and improvements may no longer be expected. However, this does not pose a problem 
since it only occurs when the estimate is already approximately x. 

It is clear from construction of the RKJL algorithm (and especially in light of Remark 2 above), 
that convergence using RKJL is at least as fast as the standard randomized version. Moreover, 
when the error produced by applying is small, the RKJL method will project onto the "best" 
hyperplane out of those it selected in that iteration. Since the probability of choosing this "best" 
row when selecting only a single row is strictly less than the probability of choosing that row when 
a set of rows is selected, this implies that the only case in which RKJL would not provide a strictly 
faster convergence rate is when (ai,x k ) = (aj,x k ) for all rows a^, a,j selected in the kth. iteration. 

Given a current estimate x k , one can explicitly compare the expectation of the improvement 
the next estimation provides, for both the RKJL and standard randomized methods. First observe 
that if P denotes the projection in the kth iteration, then x k _i — x k resides in the kernel of P, and 
is thus orthogonal to the space onto which P projects. This space contains x k — x since x is the 
solution to all equations Ax = b and x k = Px k -\. Therefore, x k — x and x k -\ — x k are orthogonal, 
implying that 

\\x k - x k+ i\\l = \\x - x k \\l - \\x - x k+1 \\l. (3.1) 

The relation ()3.ip shows that the larger \\x k — x k _i\\2, the bigger the improvement made in that 
iteration. We thus fix an estimation x k and analyze the expectation of \\x k — Xfc+i||2 for the RKJL 
method versus the standard one. For convenience, we again consider the real and homegenous case 
(ie. when 6 = 0), and assume the rows of A have unit norms. We then have the following result. 

Theorem 3.2 Fix an estimation x k and denote by x k+ \ and x^ +1 the next estimations using the 
RKJL and the standard RK method, respectively. Set 7* = \(aj,x k }\ 2 and reorder these so that 
7* > 72 > • • • > 7m- Then when d = C5~ 2 log n, 



E||xfc+i — 2&H2 < min 



m . 

E||xfc +1 - - (pj - + 25 i - 

j=l m 
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where 



n ~ lJ j<m-n+l 



(. 0, j > m — n + 1 
are non-negative values satisfying Yl^jLiPj = 1 anc ^ Pi > P2 > • • • > = 0. 

Proof Since we assume the rows of A have unit norm and that b = 0, we see that 7j is precisely 
the value of — a^fclli if t ne algorithm were to select row J. We begin by examining the kth 

RKJL iteration. 

Let C denote the set of rows chosen in the selection step, so that \C\ = n. If exact geometry were 
preserved (i.e. 5 = 0), then the method would simply select the index of the largest A* contained 
in C However, due to the error induced by even when the "best" row is selected to be in C, the 
algorithm may not choose this row for the projection. 

We thus define sets Tj for j = 1, ... to, which consist of rows which could be "confused" in this 
way, 

T j = {i ■ IT* - 7j*l < 2 <5}> and fij = mm{j* : i £ Tj}. 

From Lemma 1331 if j £ £ and 1, . . . ,j — 1 ^ £, then the worst row we could choose is one that 
would give ||xfc+i — £fc||| = [ij. Therefore, 

E||x fc+ i - acfc ||| = Ej7} 

> /iiP(l G C) + // 2 P(2 G C n 1 i C) + . . . + /i m _ n+ iP(l, 2, . . . , m - n $ C) 

m—n+l 

= ^(jG£ni,...,j-i^4 

Since each row of ^4 has equal norm, each row of A is equally likely to be selected in C Thus 

tm-j\ 

P(jG£ni,...,j-i^£) = ^. 

Therefore, we have 

m— n+l 

- x k \\l > ^ //jPj- 
j'=i 

Lemma 13.11 and the definition of Tj guarantee that \Xj > max(7| — 25,0). This along with the 
fact that YliPj = 1 yields 

m— n+l 

n\x k+l - Xk \\i > ( ito) - 26 - ( 3 - 2 ) 

Finally, since all the rows of ^4 have the same norm, the standard RK method selects each row 
uniformly at random, so that 



EK +1 -x fe ||| = (3.3) 

3=1 



s 



Combining ()3.3[) with (|3.2j) and (|3.ip we have 

E||x fc+ i - x\\l = E\\xk - x\\l - \\x k - Xk + i\\l 

m—n+l 



<E\\x k -x\\ 2 2 - ( £ 7>i +25 



EK +1 -ar||l + E||x^ +1 -a; fc ||l-( £ 7>jj + 2<5 



m— ra+1 



m 1 

E||4 +1 -^III-E(^--)^ + 2^ 



This along with the fact that by construction of RKJL, E||xfc+i — < E||x£ +1 — xjH, completes 
the claim. ■ 

Remarks. 1. Theorem 13.21 gives a lower bound, which shows improvements in the "worst case", 
when the error induced by the Johnson-Lindenstrauss projection causes the method to choose a row 
of A in the worst way. Numerical experiments (as seen in the next section) demonstrate substantial 
improvements in the convergence rate. 

2. Note that since the sequences {7j} and {pj} are non-increasing and Y^JLiPj = 1 = Sj=i ™> 

the sum ft = 5Zj=i {Pj ~ m)^j * s non-negative. Furthermore, ft = only when 7* = 7! = . . . = 7^. 
Indeed, if 7* > 7* even for just one pair i < j, then for 5 small enough, the RKJL method provides 
strict convergence improvement. Knowledge of x k and A would allow one to precisely calculate the 
improvement. 

3. The theorem is proven under the assumption that the rows of A have the same norm. The 
same argument holds (with different values of pj ) still showing improvement when this assumption 
does not hold, and numerical experiments show similar results in either case. Although the analysis 
of exact improvement in these other cases may quickly become quite complicated, we recall that 
by construction the RKJL method offers overall improvement in any case. 

It is also helpful to identify some particularly interesting scenarios, for example, when one or 
a few rows are substantially of larger norm. In this case the standard RK algorithm (noticing 
that each row is selected independently of the previous selections) will choose this row repeatedly 
with high probability. Clearly this may slow convergence (especially when these rows are highly 
correlated), and although there is still guaranteed exponential convergence, R in this case is much 
larger so that the guaranteed rate is slower. In RKJL, although the guaranteed worst case rate 
is the same, these large rows are likely not to be selected when their contributions toward the 
solution is minimal. This will again speed up convergence, since in the language of Theorem 13.21 
the A£ corresponding to the rows which are highly correlated with the previous projection will be 
such that k ~ m. This means that the same argument as in Theorem 13.21 will hold for all X*- with 
j < k ~ m. Although this means the improvement in RKJL may not be as substantial as in the 
case of equal normed rows, we will still see a large improvement. 

These ideas highlight the fact that the selection strategy is not optimal in general. For example, 
if there are k <C m rows which are highly uncorrelated but have very small norm, and m — 
k ~ m equal rows with substantially larger norm, neither RK nor RKJL will perform well. Of 
course if this were reversed, with the uncorrelated rows having large norm and the equal rows 
having small norm, the selection strategy will yield excellent convergence in both RK and especially 
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RKJL. We emphasize again that the choice of the selection rule is certainly not optimal in general, 
but is computationally efficient and allows for provable expected exponential rate of convergence. 
Choosing an optimal selection strategy for a given system is itself a problem of high complexity. If 
one is using RKJL and investing preprocessing time to perform dimension reduction, one may also 
wish to simply normalize the rows to avoid such difficulties. 

Theorem 13.21 implies the following corollary, showing the improved convergence using RKJL 
when exact geometry is preserved (i.e. when 5 — > 0). 

Corollary 3.3 Fix an estimation and denote by x^+i and x* k+l the next estimations using 
the RKJL and the standard method, respectively. Set 7* = \(aj,Xk)\ 2 and reorder these so that 
7* > 72 > • • • > 7m- Then when exact geometry is preserved (5—^0), 

E\\x k+1 - x\\l < E\\x* k+1 - x\\l - (Pj ~ — p*j- 

j= i m 

4 Numerical Results 

We now demonstrate improved convergence using the RKJL method. The first experiment we 
run is in the computationally infeasible situation where we do not use the Johnson-Lindenstrauss 
projection, but simply choose the best row out of the randomly selected n rows. This experiment 
will demonstrate the improved convergence using RKJL when the error 5 induced by $ goes to 0. 
This is the best improvement one can hope for in RKJL. When the problem sizes grow very large, 
the effect of the 5 2 term in (|2.2p becomes minimal, so it may be realistic to take 5 quite small. 
For these simulations we use a 60000 x 1000 matrix with Bernoulli entries and use a homogeneous 
system with an initial estimate chosen uniformly at random on the sphere. We see in Figure Q] that 
the convergence in this scenario is significantly improved. 
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Figure 1: ^2-Error (y-axis) as a function of the iterations (x-axis). The dashed line is standard 
Randomized Kaczmarz, and the solid line is the modified one, without a Johnson-Lindenstrauss 
projection. Instead, the best move out of the randomly chosen n rows is used. Note that we 
cannot afford to do this computationally. 
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Next we run simulations using a Johnson-Lindenstrauss matrix <£. We generate $ and the 
system Ax = b the same as above, but now run RKJL using various values of d. In Figure [2] we 
see exactly what we expect, that with higher values of d (corresponding to lower 5 values), we have 
much quicker convergence. The speedup in convergence using larger d needs to be weighed against 
the increase in computation per iteration, as was discussed in Section [2J Since right now there are 
no theoretical guarantees on precisely how the convergence is affected by larger d, this comparison 
should be done empirically. Finally, it is clear that as m and n grow large, the impact on d of 
forcing 6 to be small becomes minimal. Thus for very large systems, using d = O(logn) will give 
convergence that looks more like that in Figure [TJ 
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Figure 2: ^2-Error (y-axis) as a function of the iterations (x-axis) for various values of d with 
m = 60000 and n = 1000. 
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